{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tokenizer（分词器）：\n",
    "\n",
    "      RegexTokenizer基于正则表达式提供更多的划分选项。默认情况下，参数“pattern”为划分文本的分隔符。或者，用户可以指定参数“gaps”来指明正则“patten”表示“tokens”而不是分隔符，这样来为分词结果找到所有可能匹配的情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import Tokenizer, RegexTokenizer\n",
    " \n",
    "sentenceDataFrame = spark.createDataFrame([\n",
    "    (0, \"Hi I heard about Spark\"),\n",
    "    (1, \"I wish Java could use case classes\"),\n",
    "    (2, \"Logistic,regression,models,are,neat\")\n",
    "], [\"label\", \"sentence\"])\n",
    "tokenizer = Tokenizer(inputCol=\"sentence\", outputCol=\"words\")\n",
    "wordsDataFrame = tokenizer.transform(sentenceDataFrame)\n",
    "for words_label in wordsDataFrame.select(\"words\", \"label\").take(3):\n",
    "    print(words_label)\n",
    "regexTokenizer = RegexTokenizer(inputCol=\"sentence\", outputCol=\"words\", pattern=\"\\\\W\")\n",
    "# alternatively, pattern=\"\\\\w+\", gaps(False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### StopWordsRemover\n",
    "\n",
    "    StopWordsRemover的输入为一系列字符串（如分词器输出），输出中删除了所有停用词。停用词表由stopWords参数提供。一些语言的默认停用词表可以通过StopWordsRemover.loadDefaultStopWords(language)调用。布尔参数caseSensitive指明是否区分大小写（默认为否）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import StopWordsRemover\n",
    " \n",
    "sentenceData = spark.createDataFrame([\n",
    "    (0, [\"I\", \"saw\", \"the\", \"red\", \"baloon\"]),\n",
    "    (1, [\"Mary\", \"had\", \"a\", \"little\", \"lamb\"])\n",
    "], [\"label\", \"raw\"])\n",
    " \n",
    "remover = StopWordsRemover(inputCol=\"raw\", outputCol=\"filtered\")\n",
    "remover.transform(sentenceData).show(truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### n-gram\n",
    "    \n",
    "     NGram的输入为一系列字符串（如分词器输出）。参数n决定每个n-gram包含的对象个数。结果包含一系列n-gram，其中每个n-gram代表一个空格分割的n个连续字符。如果输入少于n个字符串，将没有输出结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import NGram\n",
    " \n",
    "wordDataFrame = spark.createDataFrame([\n",
    "    (0, [\"Hi\", \"I\", \"heard\", \"about\", \"Spark\"]),\n",
    "    (1, [\"I\", \"wish\", \"Java\", \"could\", \"use\", \"case\", \"classes\"]),\n",
    "    (2, [\"Logistic\", \"regression\", \"models\", \"are\", \"neat\"])\n",
    "], [\"label\", \"words\"])\n",
    "ngram = NGram(inputCol=\"words\", outputCol=\"ngrams\")\n",
    "ngramDataFrame = ngram.transform(wordDataFrame)\n",
    "for ngrams_label in ngramDataFrame.select(\"ngrams\", \"label\").take(3):\n",
    "    print(ngrams_label)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Binarizer\n",
    "\n",
    "     二值化是根据阀值将连续数值特征转换为0-1特征的过程。\n",
    "\n",
    "Binarizer参数有输入、输出以及阀值。特征值大于阀值将映射为1.0，特征值小于等于阀值将映射为0.0。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import Binarizer\n",
    " \n",
    "continuousDataFrame = spark.createDataFrame([\n",
    "    (0, 0.1),\n",
    "    (1, 0.8),\n",
    "    (2, 0.2)\n",
    "], [\"label\", \"feature\"])\n",
    "binarizer = Binarizer(threshold=0.5, inputCol=\"feature\", outputCol=\"binarized_feature\")\n",
    "binarizedDataFrame = binarizer.transform(continuousDataFrame)\n",
    "binarizedFeatures = binarizedDataFrame.select(\"binarized_feature\")\n",
    "for binarized_feature, in binarizedFeatures.collect():\n",
    "    print(binarized_feature)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PCA\n",
    "\n",
    "    主成分分析是一种统计学方法，它使用正交转换从一系列可能相关的变量中提取线性无关变量集，提取出的变量集中的元素称为主成分。使用PCA方法可以对变量集合进行降维。下面的示例将会展示如何将5维特征向量转换为3维主成分向量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import PCA\n",
    "from pyspark.ml.linalg import Vectors\n",
    " \n",
    "data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),\n",
    "        (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),\n",
    "        (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]\n",
    "df = spark.createDataFrame(data, [\"features\"])\n",
    "pca = PCA(k=3, inputCol=\"features\", outputCol=\"pcaFeatures\")\n",
    "model = pca.fit(df)\n",
    "result = model.transform(df).select(\"pcaFeatures\")\n",
    "result.show(truncate=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### STringindexer\n",
    "    StringIndexer将字符串标签编码为标签指标。指标取值范围为[0,numLabels]，按照标签出现频率排序，所以出现最频繁的标签其指标为0。如果输入列为数值型，我们先将之映射到字符串然后再对字符串的值进行指标。如果下游的管道节点需要使用字符串－指标标签，则必须将输入和钻还为字符串－指标列名。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import StringIndexer\n",
    " \n",
    "df = spark.createDataFrame(\n",
    "    [(0, \"a\"), (1, \"b\"), (2, \"c\"), (3, \"a\"), (4, \"a\"), (5, \"c\")],\n",
    "    [\"id\", \"category\"])\n",
    "indexer = StringIndexer(inputCol=\"category\", outputCol=\"categoryIndex\")\n",
    "indexed = indexer.fit(df).transform(df)\n",
    "indexed.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## IndexToString\n",
    "\n",
    "    与StringIndexer对应，IndexToString将指标标签映射回原始字符串标签。一个常用的场景是先通过StringIndexer产生指标标签，然后使用指标标签进行训练，最后再对预测结果使用IndexToString来获取其原始的标签字符串。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import IndexToString, StringIndexer\n",
    " \n",
    "df = spark.createDataFrame(\n",
    "    [(0, \"a\"), (1, \"b\"), (2, \"c\"), (3, \"a\"), (4, \"a\"), (5, \"c\")],\n",
    "    [\"id\", \"category\"])\n",
    " \n",
    "stringIndexer = StringIndexer(inputCol=\"category\", outputCol=\"categoryIndex\")\n",
    "model = stringIndexer.fit(df)\n",
    "indexed = model.transform(df)\n",
    " \n",
    "converter = IndexToString(inputCol=\"categoryIndex\", outputCol=\"originalCategory\")\n",
    "converted = converter.transform(indexed)\n",
    " \n",
    "converted.select(\"id\", \"originalCategory\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### OneHotEncoder\n",
    "\n",
    "    独热编码将标签指标映射为二值向量，其中最多一个单值。这种编码被用于将种类特征使用到需要连续特征的算法，如逻辑回归等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import OneHotEncoder, StringIndexer\n",
    " \n",
    "df = spark.createDataFrame([\n",
    "    (0, \"a\"),\n",
    "    (1, \"b\"),\n",
    "    (2, \"c\"),\n",
    "    (3, \"a\"),\n",
    "    (4, \"a\"),\n",
    "    (5, \"c\")\n",
    "], [\"id\", \"category\"])\n",
    " \n",
    "stringIndexer = StringIndexer(inputCol=\"category\", outputCol=\"categoryIndex\")\n",
    "model = stringIndexer.fit(df)\n",
    "indexed = model.transform(df)\n",
    "encoder = OneHotEncoder(dropLast=False, inputCol=\"categoryIndex\", outputCol=\"categoryVec\")\n",
    "encoded = encoder.transform(indexed)\n",
    "encoded.select(\"id\", \"categoryVec\").show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
